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Abstract 

We describe two main classes of one-sided trigonometric and hyperbolic Jacobi- 
type algorithms for computing eigenvalues and eigenvectors of Hermitian matrices. 
These types of algorithms exhibit significant advantages over many other eigenvalue 
algorithms. If the matrices permit, both types of algorithms compute the eigenvalues 
and eigenvectors with high relative accuracy. 

We present novel parallelization techniques for both trigonometric and hyperbolic 
classes of algorithms, as well as some new ideas on how pivoting in each cycle of 
the algorithm can improve the speed of the parallel one-sided algorithms. These 
parallelization approaches are applicable to both distributed-memory and shared- 
memory machines. 

The numerical testing performed indicates that the hyperbolic algorithms may 
be superior to the trigonometric ones, although, in theory, the latter seem more 
natural. 

Keywords: Hermitian matrices, eigenvalues, Jacobi algorithm, parallelization 
AMS subject classifications: 65F15, 65Y05, 65Y20, 68W10 

1 Introduction 

Among a variety of diagonalization algorithms for Hermitian and symmetric matrices, 
the Jacobi algorithm is the oldest and the simplest one, but is often considered too slow 
for practical usage. However, Jacobi-type algorithms have recently returned to the focus 
of active research, mostly due to their accuracy properties and inherent amenability for 
parallelization. 

Demmel and Veselic [5] showed that both the one-sided and the two-sided Jacobi algo- 
rithms are accurate in the relative sense for positive definite matrices. To explain precisely 
what the "accuracy in relative sense" means, we need to define the scaled condition Ks{H) 
of a Hermitian positive definite matrix H. Let H be scaled such that 

A:^D-^HD-\, i? = diag((/in)'/',...,(/in„)'/'), (1.1) 
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where n is the order of H. Then Ks{H) := K2(A) = ||A||2|| A"i||2. According to [23], A is 
nearly optimally scaled, in the sense that 



K2{A) < nmmK2{^HA), 

over all diagonal matrices A. The previous inequality implies K2iA) < nK2{H), but it 
frequently happens that K2{A) ^ H2{H). 

Demmel and Veselic in |8 also proved that the eigenvalues Ai of H can be computed 
with a small relative error, i.e., they replaced the standard relative bound 

< ef{n)n2{Hl 

by 

< £/W^2(A), (1.2) 

where e is the machine precision, and / is a slowly increasing function of the order n. 

In the case of a positive definite H, the two-sided Jacobi algorithm can be viewed as 
the singular value decomposition (SVD) of a factor G of the matrix H 

H = GG*. 



Moreover, for a positive definite H, the eigenvalues of G*G and GG* are equal, so we can 
choose to transform either GG* or G*G. The sequence of two-sided unitary transforma- 
tions which diagonalizes GG* needs to be applied only from one side, say the right-hand 
side, i.e., on G* . 

All the angles in this process are calculated from the temporary iterate :— Gfi*^, 
but applied only on G|. If the iterates Gfil*^ tend to a diagonal matrix for ^ — > oo, then 
G*g tends to a matrix with orthogonal (but not orthonormal!) columns. The squared 
norms of the columns of the final are the eigenvalues of H. A similar fact holds for 
the matrix G*G and its factor G. 

li H E C"xn jg an indefinite matrix of rank m, m < n, the diagonalization task 
is harder to deal with. The Jacobi-type algorithms that possess the relative accuracy 
property work on a factor of H . The factor of H is computed by using Slapnicar's mod- 
ification (see [21]) of the Bunch-Par lett Hermitian indefinite factorization with pivoting 

PHP^^GJG*, J = diag(jii,...,j™™), (1.3) 

where P is a permutation matrix, G G C"^™ is a block lower trapezoidal matrix with 
diagonal blocks of order one or two, and jii G { — 1, 1}, for 1 < i < m. If H is nonsingular, 
G is a block lower triangular matrix. 

Similarly to the positive definite case, the indefinite Jacobi diagonalization can be 
viewed as the hyperbolic SVD of the matrix G from (|1.3p with respect to the signature 
matrix J (see, for example, |26|1. 

G = c/sy*, 

where U is an orthogonal matrix, E is a diagonal matrix with non-negative entries and 
V* is a J-orthogonal matrix, i.e., V* JV = J. 

This SVD can be obtained either by orthogonal transformations applied to G from 
the left, or by hyperbolic ones applied from the right. The former case is known as the 
one-sided Jacobi algorithm |25l |S] , and the latter is known as the hyperbolic one-sided 
Jacobi algorithm [24'. 

If the factor G is well-scaled, then both algorithms are accurate in the relative sense ["251 
[S]. Slapnicar [52] generalized the proof of the relative accuracy [8 to the case of the 
hyperbolic Jacobi algorithm. Namely, if iJ is a nonsingular indefinite matrix, and the 
relation p.ip is replaced by 

A = D-^HD-\ D = diag((/iii)i/2, . . . , {Knf''), 
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where H = V is the positive definite polar factor of H, then the accuracy of the 
hyperbohc Jacobi algorithm is essentially given by (|1.2p . From the bounds for the one- 
sided Jacobi algorithms, it is obvious that the matrix H permits an accurate computation 
of eigenvalues if the scaled condition K2{A) is small. 

On the other hand, Dopico, Koev and Molera proved that the one-sided trigono- 
metric Jacobi algorithm computes the eigenvalues of H := XDX*, where D is a diagonal 
nonsingular matrix, with a relative bound given by 



A. 

In this paper we develop a sequence of modifications, applicable to both trigonometric 
and hyperbolic one-sided algorithms, aimed at increasing the speed of parallel imple- 
mentations of these algorithms. The combined effect is a speedup of approximately 30% 
over the straight-forward parallel realizations of the trigonometric and hyperbolic [2D] 
algorithms. 

The rest of the paper is organized as follows. In Section 2 we briefly describe the 
sequential Jacobi algorithms with emphasis on the details needed in a parallel imple- 
mentation. Section 3 is devoted to detailed descriptions of the corresponding parallel 
algorithms. Specially, we discuss the advantages and the drawbacks of the algorithms, 
and present the modifications mentioned above. In the final section, we give some of the 
results of the performed numerical testing. In the Appendix we derive the error bounds 
for the eigenvector computation in the trigonometric case. 



2 Block algorithms, pivot strategies and parallelization 
2.1 Pointwise algorithms 

The pointwise Jacobi algorithms, that diagonalize a 2 x 2 matrix in each step, both the 
two-sided and the one-sided ones, are well known and described in details, for instance, 

in Eiiiiiaiiiiin]. 

The one-sided trigonometric algorithm operates on G* from the right, choosing pivots 
from H — GJG* (see ['9'). The hyperbolic algorithm operates from the right on G, 
diagonalizing the matrix pair {A, J) := {G*G, J) (see |24|1. 

To unify the notation for both the trigonometric and the hyperbolic Jacobi-type algo- 
rithms, let A° — H,G° — G* in the trigonometric, and A° = A, G° — G in the hyperbohc 
algorithm. Fig. [T] shows the relation of the matrix A° to its factor G° . The matrix A°p is 
called the pivot (sub)matrix. 



A° G° 




Figure 1: Block columns G°p :— [G°, G°] in G° correspond to a square block Ap in A° . 

We summarize the pointwise trigonometric and hyperbolic algorithms below. Both 
algorithms operate on a chosen pair G°p := [g°,gj] of ordinary columns of G° . 
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Trigonometric Jacobi: 

1. Find the next pivot pair {i,j) and com- 
pute 



A°p := Hp 



Ki 














9j. 



J [gt g*j] 



2. Diagonalize A°p by a single trigonometric 
rotation, 



Qp = 



cos if e sm tp 
-e~'°sin(y3 cos</3 



(2.1) 



i.e., find cos(^, sin(/p and e*^ such that 





h'i. 



3. Apply the rotation to the columns g* and 
9h 

[{g*i)' {g*j)'] = [g*i g*i] Qp- 



Hyperbolic Jacobi: 

1. Find the next pivot pair and com- 

pute 









gl 










P'ij 






gj. 



[gi gj 



2. If the signs in Jp (the 2x2 submatrix of 
J that corresponds to A°p) are the same, 
choose the trigonometric rotation (|2.ip , 
otherwise choose the hyperbolic rotation 

cosh </3 e"* sinh 95 
e^'"sinh(/5 coshtp 



to diagonalize A°p, 
QpA°pQp 



a'ii 
a'. 



3. Apply the rotation to the columns gi and 

gj, 

[g^ g'j] = [g-i gj] Qp- 



2.2 Blocked algorithms 

In order to speed up the algorithms, a pivot pair of columns should be replaced by a 
pivot block that contains a pair of block columns (see Fig. [TJ. The generalization of the 
pointwise algorithm to a blocked one transforms the whole pivot block in each step, and 
this can be done in two different ways: 

• by reducing its off-diagonal norm (the so-called block- oriented algorithm) |13| ; 

• by diagonalizing it (the so-called full block algorithm) [T3] . 

The main steps of the block-oriented and the full block algorithm are as follows. 



Block-oriented algorithm: 

1. Before each block sweep, for all block 
columns G° form the matrix A°p), 



GiJG*, trigonometric case, 
G*Gi, hyperbolic case. 

Annihilate the upper triangle of A°p, only 
once. Apply the accumulated rotations 
to G°. 

Find the next pair of pivot block columns 
G° and G°j. Form the pivot block A°p, 

A-ii Ai 
_{A°j)* A] 

and annihilate each element of A°j only 
once. 

Apply the accumulated rotations to the 
block columns G° and Gj. 



A°p 



Full block algorithm: 

1. Find the next pair of pivot block columns 
G° and G?. Form the pivot block A°p, 



A°p 



A° A° 

■^ii ij 

(Aij)* A°j 



and diagonalize it. 
2. Apply the accumulated rotations to the 
block columns G° and G°j. 



The pivot matrices A°jj and A°p are processed in a one-sided manner, i.e.. they are first 
factorized. and then the corresponding pointwise one-sided algorithm is applied to the 
factor. 
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2.3 Block factorizations 



A nai've implementation of the blocked Jacobi algorithm would apply each transformation 
directly to the tall and skinny block 

Go . l/^o /^ol 
p — l^i 

(a factor of Ap), which is very inefficient. In practice, each block Gp is preprocessed to 
obtain a square factor Rp of Ap. Then. Rp is transformed, and these transformations 
are accumulated in a "work" matrix Qp. Finally, G°p is postmultiplied (only once) by the 
accumulated Qp, so the "update" is now a BLAS 3 operation. 

This kind of an accumulated application of transformations influences the overall al- 
gorithm speed tremendously. In principle, there are two ways to compute the required 
square factor R°p. 

1. By forming A°p and computing a suitable Hermitian factorization afterwards, i.e., 
the Cholesky factorization in the hyperbolic case 

A°p = R*R, 

where R is an upper triangular matrix, and the Hermitian indefinite factorization 
in the trigonometric case 

A°p ^ P^R*JpRP, (2.2) 

where P is a permutation matrix, i? is a block upper triangular matrix with diagonal 
blocks of order 1 or 2, and Jp is a diagonal signature matrix containing the inertia 
of Ap. Handling the extra permutation in this case is detailed in Section [3] 

2. By the QR-like factorization of G°p, i.e., the ordinary QR factorization in the hy- 
perbolic, and the hyperbolic QR factorization [TS" in the trigonometric algorithm. 

The former case is faster, as it involves the multiplication of two block columns and 
the factorization of a relatively small square matrix, while the QR-like approach has a 
significant overhead of preserving the input block by copying, and applying the computed 
transformations to the tall and skinny original matrix. The details of preprocessing of Ap 
blocks will be described in Section [3J for each type of the Jacobi algorithms. 

2.4 The parallelization 

The two different pivot blocks, 

G°p:=[G°G°] and := [G° G°] , 
with i ^ j and k ^ I, can be independently and simultaneously transformed if 

{z,j}n{fc,O = 0. 

This property, together with blocking, is the basis of parallel implementations of these 
algorithms, in a sense that the independent blocks are assigned as independent tasks to 
computational processes. 

The iterations of the Jacobi algorithm are usually called sweeps (or cycles). In the 
block algorithms, we distinguish block (or outer) sweeps over pivot blocks, and inner 
sweeps of the pointwise algorithm inside each of the pivot blocks. 

A block ( outer) pivot strategy is the order in which the pivot blocks A°p are chosen. 
An inner pivot strategy is the order of annihilations inside all pivot blocks. 

In a block sweep, the pivot blocks are processed according to the block-oriented or 
the full block algorithm. Recall, in an inner sweep of the block-oriented algorithm, each 
off-diagonal element of A° is annihilated only once (or finitely many times, if the chosen 
pivot strategy is quasicyclic, see |10|[T^[T7] V In the full block algorithm each pivot block 
A°p is diagonalized. 

There are many choices of pivot strategies, but only a few of them are suitable for 
parallel computation and provably convergent. Our choice of the outer strategy is the 
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modulus pivot strategy, described in (TH]. This strategy, with the row-cychc inner strategy, 
is weakly equivalent to the block row-cyclic strategy (with the same inner strategy) , and, 
therefore, ensures the convergence of the Jacobi algorithm, as shown in recent works |13l 

ttH. 

The pointwise modulus strategy simultaneously annihilates the elements of A° on each 
antidiagonal. If n is even, and the antidiagonal is denoted by an even number, e.g., by 
2 in Fig. [2l the modulus strategy annihilates only n/2 — 1 elements. If the additional 
element is also annihilated (presented in a lighter hue) , the strategy is called the modified 
modulus strategy. The same principle is used for the corresponding block strategies that 
operate on blocks, instead of elements. 



\ 3 


4 

/ 




/ 

n 


/ 

1 










/ 

2 


















/ 

n-2 




n-1 



Figure 2: Annihilations of A° using the (modified) modulus pivot strategy. 

Fig. [3] shows the block layouts for odd and even sweeps of the modulus pivot strategy. 
While the modulus pivot strategy can be inefficient in the sequential implementation 
(frequent cache spilling), it is ideal for parallel implementations. 



012210 201210 220110 212010 211200 210120 



Odd sweep. 



210012 021012 002112 010212 011022 012102 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I IJ 

Even sweep. 

Figure 3: Modulus strategy for 6 blocks. A pair of block columns denoted by the same 
number label comprises a pivot block in each step. The white blocks are skipped (not 
transformed) in the modulus, but are processed in the modified modulus strategy. 



3 Parallel implementations of the Jacobi algorithms 

The trigonometric and the hyperbolic, both block-oriented and full block, Jacobi algo- 
rithms are parallelized in terms of single-threaded processes, communicating through the 
MPI (Message Passing Interface) stack. The implementations are independent of the un- 
derlying memory and network architecture, scaling from the symmetric multiprocessing 
(SMP) , over the non-uniform memory architecture (NUMA) , to the clusters of intercon- 
nected machines, provided only that the chosen MPI distribution efficiently supports the 
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target architecture (s). 

By p > 2 we denote the number of parallel processes involved in a running instance 
of any of our algorithms. These processes are arranged in a one-dimensional torus (ring), 
realized in MPI as a one-dimensional cyclic Cartesian virtual topology. 

Initially, a block (with two block columns) 

Go [/^° /^o 1 
P — I'^r+l '^2p-r\ 

is assigned to a process with MPI rank r, where < r < p. The partitioning of G° into 
pivot blocks G°p should be maintained as uniform as possible. In every step and at each 
process, the widths of the block columns in a block are allowed to differ by at most one, 
i.e., the widths of blocks G°p of any two processes differ by at most two. 

The ring topology is natural for the modulus pivot strategy (and vice versa), because 
it implies shifting of only one block column per process, cyclically after each step in a 
sweep. The cycle direction is defined with respect to the parity of a sweep number: in odd 
sweeps, the process r sends a block column to the process (r — 1) mod p, and receives one 
from the process (r + 1) modp, while in even sweeps, the process r sends a block column 
to the process (r + 1) mod p, and receives one from the process (r — 1) mod p. 



Algorithm 3.1: Modulus pivot strategy 



Initialization: Process r computes the indices of the initial column blocks 

{i_blk,j_blk). An auxilliary pair of indices {ip,jp) is used for the 
determination of the pivot indices in all subsequent steps: 

ip=r+l; i_blk=ip; jp = 2-p—r, j_blk = jp. 

Description: Routine Mext_Pair computes the indices of the next pivot pair 

{i_blk,j_blk) in the process r. It also tells whether to send receive 
the block column i_ blk or j_ blk. Routine Send_Receive determines 
the rank of the process from which the next block will be received, 
and the rank of the process to whom the next block will be sent. 



Next_Pair (r); 
begin 

if {ip + jp) > 2 ■ p then 

snd_ blk = i_ blk; ip= ip+1; 
if ip = jp then 

I ip = ip-p; jp= ip; 
end if 

i_ blk = ip; rcv_ blk = i_ blk, 
else 

snd_ blk — j_ blk; jp — jp + i; 
j_ blk = jp; rcv_ blk = j_ blk; 
end if 
end 



Send_Receive (r); 
begin 

if {nsweep mod 2) > then 

snd_ rnk = (p + r — 1) mod p; 
rcv_ rnk = {p + r+1) mod p; 
else 

snd_ rnk = {p + r+ 1) mod p; 
rcv_ rnk = {p + r— I) mod p; 
end if 
end 



To further elaborate the communication pattern in the modulus pivot strategy, let us 
describe the "process map" for a parallel Jacobi step. Instead of static processes and block 
columns being swapped among them, consider the block-partition of the matrix A° , and 
assign a process to each block Ap to be transformed in a given step of the (modified) 
modulus strategy. At the start of a sweep, all the processes lie on the antidiagonal blocks. 
When transitioning from one step to another, the processes move downwards, i.e., the 
row indices of the assigned blocks are incremented, while the column indices remain the 
same. 

In an even step, a single (but each time different) process hits a diagonal block and 
changes for itself the above map traversing rule. The process keeps the column index of 
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its block, and takes the row index unoccupied by other processes. In the subsequent steps, 
such a process keeps its block row index (say, s) fixed, and traverses the matrix A° row- 
wise, starting from the block (s, s + 1), and incrementing the block column index in each 
step. After the last step of a sweep the processes are positioned again on the antidiagonal, 
but their order from the start of a sweep is now reversed (see routine Next_Pair in 
Algorithm [XT]) . 

By looking at the process map we can tell whether a process replaces its first or its 
second block column in the after-step communication. While the process keeps moving 
downwards, it exchanges its first block column. When it hits a diagonal block, and 
afterwards, it exchanges its second block column. The efficient implementation of this 
traversing pattern and communication rules is given in the step transitioning procedure 
Send_ Receive in Algorithm 13. II The block column exchanges are realized with MPI, by 
using the mpi_sendrecv routine (one could also choose mpi_sendrecv_replace). 



3.1 Trigonometric Jacobi algorithms 

We can now switch our focus to the detailed description of the operations performed by 
a single process, before the block column interchanges. 

Formation of the pivot submatrix in the block case is the same as in the pointwise 
case, except that the ordinary columns g* and g* are now replaced by the block columns 
G* and G*, owned by a particular process. 



J \g* g*] 



(3.1) 



The nonsingular pivot submatrix Hp is then factored by the Bunch-Parlett Hermitian 
indefinite factorization with complete pivoting jHl [H] , as in {\2.2\ . 

Note that the pivoting can change the affiliation of an individual column from one block 
column to the other. To prevent this, and to ensure the convergence of the algorithm, the 
columns should be restored to their original positions after the factorization. 

If we apply the sequential one-sided trigonometric Jacobi algorithm to the factor F = 
RP of the original Hp, i.e.. 

Hp = F*JpF = {P'^R*)Jp{RP), 

we get the trigonometric full block (TF) and the trigonometric block-oriented (TB) al- 
gorithms. The similar naming convention will be used for all algorithms: the first letter 
denotes the type of the algorithm (trigonometric/hyperbolic), the second one denotes 
the type of blocking (block-oriented/full block), while the remaining letters describe the 
pivoting strategy. 

If the one-sided Jacobi algorithm is applied directly to R, instead of F, the original 
column arrangement is lost, but some speedup is gained, especially in the full block (TFC) 
case. After the transformation of R (either one sweep, or the full diagonalization) the rows 
of the unitary matrix Up, that transforms R, are reordered according to the permutation 
used in the Bunch-Parlett factorization. 

Though the convergence of TFC and TBC algorithms is not yet proven, we firmly 
believe that, due to the nature of complete pivoting in the Bunch-Parlett factorization, 
no column exchanges are needed after finitely many sweeps. As a result of pivoting, after 
the completion of the algorithm, the eigenvalues are sorted decreasingly in their absolute 
values. 

However, there are two possible drawbacks in blocked trigonometric algorithms. The 
first one is that, even if H is nonsingular, the pivot submatrix Hp need not be. 



Example 1. Let 



H 



10 11 
111 
1110 
110 1 
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The matrix H is of rank A, with eigenvalues —1, 1, 1, 3. The Hermitian indefinite 
factorization of H (with complete pivoting) gives H = P^R*JRP, where 



R 



1 
1 

00 



1 


1 ' 


1 


1 


v/3 




V2 


V2 


1 


1 


V2 


V2-I 



J = diag(l, 1,-1,1) 



P = /. 



If R is partitioned in 4 (block) columns, the modulus strategy, in its first step, takes the 
first and the last column as the first pivot pair, and the middle two columns as the second 
pivot pair. The pivot matrices in these two processes are the same. 



A° — 



1 1 
1 1 



and obviously singular. 



□ 



In the singular case, the Bunch-Parlett factorization gives the upper trapezoidal factor 
R which is unsuitable for the trigonometric Jacobi method. The solution is either the two- 
sided Jacobi algorithm applied on PHpP^ . or the QR factorization of R* . In the latter 
case we obtain a full rank factor of a lower dimension. After the orthogonalization of this 
smaller factor, we have to assemble the full unitary matrix that diagonalizes Hp. 

If the eigenvectors of H are also needed, the algorithm requires the additional global 
matrix U, distributed in the same way as the block columns of G° = G* , starting with 
U = I. In each step, in addition to Gp, this matrix U is locally multiplied by Up in order 
to accumulate the eigenvectors of H. 

Instead of the Bunch-Parlett factorization in (|2.2p , the hyperbolic QR factorization (or 
JQR, for short) [18 can be applied, as well. Though a bit more accurate factor is produced, 
we are only interested in accumulating U p during the diagonalization, with columns as 
orthogonal as possible. Even with a factor of lower accuracy, U p can still be orthogonal 
to the machine precision, and, thus, useful for the approximate diagonalization of Hp. 
The JQR factorization, as already discussed, has a bookkeeping overhead, rendering it 
too inefficient, with no significant final accuracy gained. 

The joint outline of all described parallel one-sided trigonometric algorithms is given 
as Algorithm 13.21 

For the pointwise algorithm, Dopico, Koev and Molera in [9, Theorem 7] showed that 
the computed eigenvector matrix U satisfies 



\U ~U\\f = 0(emax{; 



,3/2, 



Nr}). 



where n is the order of H, r is the rank of H, while Np is the number of rotations used. 
If the algorithm is parallelized, under the standard IEEE model of the floating-point 
arithmetic, after I stages, we obtain the linearized bound 

\\U -U\\2 < Ue£y/2^. 

The proof of this fact can be found in the Appendix. 

In principle, the eigenvector matrix U can also be determined from the starting factor 
G* , and the hyperbolic S VD of G* , which can be written as 



G*U = V^. 



(3.2) 



Just note that G*U = VI] is the final matrix computed by the algorithm. Multiplication 
of (1221) from the left by (V^E)* J yields 



(FS)*JG*L/ = EV = A, (3.3) 
where A is a diagonal matrix of the computed eigenvalues. From (j3.3p it follows that 

U* = A-^{VI)*JG*. 
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Algorithm 3.2: Iterative part of trigonometric algorithms TB, TBC, TF. TFC 



Description: Diagonalization of the matrix H — GJG* by the parallel one-sided 
trigonometric Jacobi algorithms. Assumption: G* is obtained by the 
Hermitian indefinite factorization of H with complete pivoting, and 
then reordered to get J partitioned as J = diag(/„pos, —In-npos)- 
The matrix G* and the unitary matrix of eigenvectors U are initially 
divided into 2p block columns, distributed so that the pair of blocks 
(r + 1, 2p — r) resides in process r. In each process the first block 
column is denoted by index i, and the second one by j. 

Trigonometric_Jacobi (G, J, n); 
begin 
repeat 

// compute the pivot submatrix Hp 
compute Hp from (|3.ip : 

// compute the Hermitian indefinite factorization of Hp 
Hermit ian_Indefinite_Factorization_with_Complete_Pivoting(7Jp, 

i?); 

if algorithm = TB or TF then 
I reorder the columns of R to their initial positions; 
end if 

if algorithm = block- oriented then 

if this is the first step in a sweep, annihilate all the off-diagonal elements 
of Hp; 

for all the other steps, annihilate only the elements of the block Hij 
from ((XT)) : 

accumulate the matrix Up of applied unitary transformations; 
else if algorithm = full block then 
diagonalize Hp; 

accumulate the unitary matrix Up that diagonalizes Hp; 
end if 

if algorithm = TBC or TFC then 

apply the permutation from the Hermitian indefinite factorization to the 

rows of Up; 
end if 

[G* G*] = [G* G*] ■ Up; 

// accumulate eigenvectors 

[U, Uj] = [U, U,] ■ Up; 

send/receive one of the blocks in [G* G*] and [Ui Uj] to/from the 
neighboring process according to the modulus strategy; 
until convergence; 
end 
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However, the relative accuracy of such a computation of eigenvectors (i.e., a bound on 
angles between the computed eigenvectors and the corresponding eigenvector subspaces) 
has not yet been proven, nor extensively tested. 



3.2 Hyperbolic Jacobi algorithms 

Hyperbolic Jacobi algorithms, instead of only trigonometric rotations, use both trigono- 
metric and hyperbolic rotations, depending on the signs in J. The algorithms diagonalize, 
by congruence, a definite matrix pair {A° , J) :— {G*G, J). 

Each process owns two block columns, Gj and Gj, and the pivot block is 

!] [G, G^] . (3.4) 

Let Jp be a part of J corresponding to Ap. Then the definite matrix pair (74p,t/p) is 
transformed per process. 

Due to the full column rank of the initial matrix G° = G, the matrix [G,;, Gj] also has 
full column rank, and Ap is a positive definite matrix. Thus, it can be factored by the 
Cholesky factorization (with or without pivoting): 

P^ApP = R*R. 



A% 



A, 



A-ii 






A* 


An. 





The rest of the computation is very similar to the trigonometric case. We need to apply 
the sequence of rotations to the columns of R. When the signs of diagonal elements 
in Jp match, a trigonometric rotation is applied, otherwise a hyperbolic one is used. 
The parameters of a hyperbolic rotation are computed as described in ^2,- Despite 
being impossible in theory, tanhtp = ±1 can occur in finite precision arithmetic. In these 
extremely rare cases, a smaller angle ip' is used for the hyperbolic transformation (usually, 
tanhif' = ±0.9). 

Here, the rotations are accumulated, per step, in the local matrix Vp*, to be ap- 
plied later to the block Gp, but the global V^* no longer needs to be maintained, since 
GV~* = UYi. The final eigenvector matrix U is obtained by normalizing the columns of 
the resulting matrix GV^*. 

The hyperbolic algorithms explained so far, the full block (HF) and the block-oriented 
(HB), can be tuned even further. If we follow the idea from the trigonometric case 
(presented in [7 for the Jacobi SVD algorithm) and use the Cholesky factorization with 
diagonal pivoting, some speedup is gained in the full block case (HFC), but lost it in 
the block-oriented case (HBC). Similarly to the sorted trigonometric case (TBC, TFC), 
the eigenvalues are computed in decreasing order by their absolute value. This happens 
because "sorting" (by pivoting) can spoil the quadratic convergence by mixing columns 
with almost the same hyperbolic norms, which correspond to different signs in J (see [ifS] 
for details). 

On the other hand, we can do the Cholesky factorization in two parts, respecting the 
positive and negative signs in J. Suppose that Ap has the following block structure 



An Ai2 
A* A 

^12 ^22 



(3.5) 



and the square diagonal blocks An and A22 correspond to the positive/negative signs 
in Jp. If A22 in (|3.5p does not exist (Jp — I), the whole block Ap is factored by the 
Cholesky factorization with diagonal pivoting. If An is non-existent (Jp = ^I), the 
same is done, but afterwards the columns are reversed to keep the column norms in 
non-decreasing order. Else, the block An is factored by the Cholesky factorization with 
diagonal pivoting, P^A^^P^ = Rl^R^^, so p.Sp becomes 



\PF 1 




An A12 




>i ■ 




Rn 




I 




R-n R12 


/ 




AI2 A22 




/ 




RI2 I. 




S 




I 



(3.6) 
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From p.6p it follows that RI1R12 = P1A12, and R12 can be computed by solving this 
triangular linear system. The Schur complement 

^ ^ ^22 ^ ^12-^12 

is also a Hermitian positive definite matrix, so we can factorize it by the Cholesky factor- 
ization with diagonal pivoting 

P2 ~ ^22^22- 



(3.7) 



Combining p.7p and 



we have 



Pi 



Rii 
P12 1 



P. 



P^SP, 



PT 



Pll P12 

I 



I 




Rii 




^11 P12P2 




I 


P2. 




/ 2 ^12 ^22. 




R22 




P^ 



The final sign-pivoted Cholesky factorization of Ap is given by 



PI 



P'^ 
^2 



A, 



Pi 



P. 



R*sRs, 



Rc 



Pii P12P2 



R 



The block column structure of the factor Rs can be written as 



R, 



[Ps,i P. 



S,2 \ J 



R 



s,i 



Pii 




R 



S,2 



22 



R12P2 
P22 



After the factorization, we need to reverse the order of columns in Rs,2 — the first 
column is swapped with the last, the second one with the penultimate, and so on. This 
reveresal tries to maintain the eigenvalues sorted in non-decreasing order, thus, ensuring 
the quadratic convergence in the case of multiple or clustered eigenvalues [11 . Moreover, 
the computed eigenvalues are sorted "for free". 

Algorithm l3 . 3l computes the specially pivoted Cholesky factorization used in algorithms 
HBSC and HFSC. 

The joint outline of all described parallel one-sided hyperbolic Jacobi algorithms is 
given in Algorithmg^l The full block (HFSC) and the block-oriented (HBSC) algorithms, 
are faster than their respective diagonally pivoted counterparts HFC and HBC, the former 
one more than the latter. 



4 Numerical testing 

Numerical testing has been conducted on a cluster of 16 blade servers, each equipped with 
two dual-core Intel Xeon 5150 CPUs at 2.66 GHz, and with 8 GB of RAM. 

The software used consists of Intel Fortran and C++ compilers and Math Kernel 
Library 11.0.074 for EM64T on GNU/Linux, and Open MPI 1.3. Our programs are 
single-threaded and, mostly, Fortran 77 and 90 based. 

We have tested nonsingular real symmetric matrices of orders n from 1000 to 16000, 
in steps of 1000. In each test matrix, the elements of the upper triangle have been 
pseudorandomly generated, with the uniform distribution of elements in [—5, 5], by using 
the LAPACK testing routine dlarnd. This kind of generation produces matrices with 
tightly clustered eigenvalues, varying from 10""^ to 10'^ in magnitude. 

Finally, the input matrices G and J have been computed by the sequential Bunch- 
Parlett factorization with complete pivoting. The generation and preprocessing of matri- 
ces is not included in the times given below. 

A performance comparison of all described trigonometric and hyperbolic parallel Ja- 
cobi algorithms is given in Fig. S] and Fig. [5l respectively. The whole computation has 
been performed in IEEE double precision arithmetic. 

At the first glance more natural, the trigonometric Jacobi algorithms are, in fact, slower 
and less accurate than the hyperbolic ones. For large matrices, the orthogonality of the 
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Algorithm 3.3: Specially pivoted Cholesky factorization with respect to Jp 

Description: The algorithm computes the specially pivoted Cholesky factorization 
of the block matrix Ap oi order np, according to the number of 
positive signs npos in Jp. The computed factor Rs is returned in 
Ap, while the permutation P is returned in a separate vector. 

Input: Up, npos. 

Input /Output: Ap. 

Output: P. 

Jp_Pivoted_Cholesky(Ap, np, npos, P); 
begin 

if npos = np then // case J = I 

factorize An by the Cholesky factorization with diagonal pivoting; 

return R, P 
else if npos = then // case J = —I 

factorize ^422 by the Cholesky factorization with diagonal pivoting; 

reverse the columns of R and P; 

return R, P 
else // case J ^ ±1 

factorize An by the Cholesky factorization with diagonal pivoting; 

compute Ri2 from the triangular linear system RnRi2 = ^i^^i2) 

S = A22 — R12R12', 

factorize S by the Cholesky factorization with diagonal pivoting, 

apply the permutation P2 from the right to R12', 
reverse the columns of Rs,2 and P2; 
return Rs, P = diag(Pi,P2) 
end if 
end 
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Algorithm 3.4: Iterative part of hyperbolic algorithms HB, HBC, HBSC, HF, HFC, 

HFSC 

Description: Diagonalization of the pair {A, J) — {G*G, J) by the parallel 

one-sided hyperbolic Jacobi algorithms. Assumption: G is obtained 
by the Hermitian indefinite factorization of H with complete 
pivoting, and then reordered to get J partitioned as 
J — diag(/„pos, —In-npos)- The matrix G is initially divided into 2p 
block columns, distributed so that the pair of blocks {r + l,2p — r) 
resides in process r. In each process the first block column is 
denoted by index i, and the second one by j. 

Hyperbolic_ Jacobi (G, J, n); 
begin 
repeat 

// compute the pivot submatrix Ap 
compute Ap from (|3.4[) : 
Jp = diag(Jjj, Jjj); 

// compute a selected type of the Cholesky factorization of 

Ap 

if algorithm = HBSC or HFSC then 

I Jp_Pivoted_Cholesky(^p, R): 
else if algorithm = HBC or HFC then 

I Diagonally_Pivoted_Cholesky_with_Reordering(Ap, R); 
else 

I Unpivoted_Cholesky (Ap, i?) ; 
end if 

if algorithm = block- oriented then 

if this is the first step in a sweep, annihilate all the off-diagonal elements 
of Ap; 

for all the other steps, annihilate only the elements of the block Aij 
from (|3.4|) : 

accumulate the Jp-unitary matrix Vp* of applied transformations; 
else if algorithm = full block then 
diagonalize Ap; 

accumulate the Jp-unitary matrix Vf* that diagonalizes the pair 
{R*R, Jp); 
end if 

if algorithm = HBC, HFC, HBSC or HFSC then 

apply the permutation from the Cholesky factorization to the rows of 

Vp*; 
end if 

[G,G,] = [G,G,]-Vf*; 

send/receive one of the blocks in [Gi Gj] to/from the neighboring process 
according to the modulus strategy; 
until convergence; 
end 
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(a) full block 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
matrix ordcr/1000 



(b) block-oriented 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
matrix order/1000 



Figure 4: Comparison between trigonometric Jacobi algorithms on 64 processes. 

computed eigenvectors U, measured by \\UU'^ — I\\f and \\U'^U — I\\f, is approximately 
lO^^'"^ or less in the hyperbolic case, while it is not below 10"^^ in the trigonometric case. 
We believe that the repeated accumulation of U in each step contributes significantly to 
the overall error, while slowing down the algorithm. 

Our conclusion is that the hyperbolic parallel Jacobi algorithms are easier to imple- 
ment, and could be substantially faster than the trigonometric ones. 

The major speedup — dramatically reducing the number of sweeps, is obtained by 
the special pivoting, both in the Bunch-Parlett (for trigonometric) and in the Cholesky 
factorization (for hyperbolic algorithms). This effect is more visible in the full block case, 
and somewhat less in the block-oriented case. For example, on matrices of order 16000, 
the number of in-process sweeps is reduced from 17 in the "classical" HF implementation, 
to 11 in the HFSC algorithm, resulting in 30% speedup. 

Therefore, we expect that, given a similar computing architecture and matrices large 
enough, the performance of HFSC and HBSC should be superior to the other algorithms 
described herein. 

The timings shown in Fig. 2] and Fig. [5] are accurate in the sense that the successive 
runs of the same algorithm on the same data differ by less than a second, mostly due to 
the fact that all MPI communication is synchronous in our implementations. Moreover, 
for performance reasons, the threads should be assigned to only one processor core during 
the entire run. Otherwise, a thread could be rescheduled at any time to an another core, 
at the operating system's discretion. This move causes a severe cache invalidation and 
slowdown, which propagates through the entire run. 

Fig. [5] shows the portion of the total execution time spent in communication. The 
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(a) full block 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
matrix ordcr/1000 



(b) block-oriented 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
metrix order/1000 



Figure 5: Comparison between hyperbolic Jacobi algorithms on 64 processes. 

relatively high percentage is caused by the synchronous nature of the communication and 
synchronization — the observed execution time of a step is always the time of an MPI 
process with the slowest computation in that step. 

This also exposes a side-effect of dealing with indefinite matrices. If one block is 
definite, and the other one indefinite, their diagonalization times may differ by up to 
30%, which hurts the full block algorithms. It remains an open question whether the 
asynchronous communication patterns could hide these latencies in part. 

We have also compared our algorithms with the ScaLAPACK routine pdsygvx, which 
has been used to compute the eigenvalues of the real definite matrix pair [A, J), where 
A := G*G is a positive definite matrix (see Fig. [7]). To get a fair comparison, the compu- 
tation of A from G (pdsyrk) is not added to the total time. 

This comparison has been done on a cluster similar to the described one, with the 
main difference being the 10 Gb Ethernet interconnection. The parameters of pdsygvx 
have been chosen as recommended ones for the most accurate eigenvalue computation. 
We have experimentally found that 64 x 64 blocks is the fastest blocking option. 

The one-sided Jacobi algorithms inside each process can be replaced by other, faster 
algorithms, provided that these algorithms give numerically ( J-)orthogonal eigenvectors 
in the trigonometric (hyperbolic) case. Since the outer Jacobi method is self-correcting, 
these eigenvectors need not be very close to the exact ones. 

For example, in the full block trigonometric case, the inner (per block) Jacobi algo- 
rithm can be replaced by the tridiagonalization and divide-and-conquer algorithm. Such 
a modification (TFDC, for short) has been tested with the LAPACK dsyevd routine. The 
total running times for TFC and TFDC are given in Table [T] 
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(a) full block (HFSC) 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
matrix order/1000 

(b) block-oriented (HBSC) 




H 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1— 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
matrix order/1000 



Figure 6: Percentage of total time spent in communication. 

While the stopping criterion of the one-sided Jacobi algorithm is natural, there is no 
easy way to stop the whole algorithm when per block divide-and-conquer algorithm is 
employed. To obtain the comparison in Table [T] if the one-sided trigonometric full block 
algorithm uses s sweeps to convergence, we run s — 2 sweeps with per block divide-and- 
conquer algorithm, and the final sweeps (until convergence) with the Jacobi algorithm. A 
more detailed study is needed to determine the optimal stopping criterion with per block 
divide-and-conquer algorithm. 

Conclusion 

The presented parallel algorithms can be even faster, if we apply the sequential blocking 
for large matrices inside each process. These algorithms, called the three-level Jacobi 
algorithms, are the subject of another paper The switching point between the non- 
blocked and the blocked algorithm inside each process depends on various factors, such as 
the processor speed and organization, and the speed of the interconnection network. Once 
the hardware is fixed, profiling (like in the self-tuning packages, e.g., atlas) is needed to 
reveal that switching point for each algorithm. 

A Appendix 

Here we present a fioating-point error analysis for the accumulation of products of nearly 
orthogonal matrices, which is used to compute the eigenvectors in the trigonometric Jacobi 
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Figure 7: Time ratio between ScaLAPACK and HFSC, HBSC on 64 processes. 









Matrix 


sizes 




Routine 


processes 


4000 


8000 


12000 


16000 


TFC 


32 


67 


369 


1034 


2525 


TFDC 


32 


60 


316 


849 


1796 


TFC 


64 


53 


298 


749 


1492 


TFDC 


64 


45 


254 


697 


1373 



Table 1: Timing (in seconds) of TFC and TFDC algorithms. 



algorithms. 

We use the IEEE standard model of floating-point arithmetic 
fl(ao6) = (ao6)(l + eo), |eo| < e, 

where o is any of the four arithmetic operations, and e is the unit roundoff error. Addi- 
tionally, we also assume that the square roots can be computed with the same accuracy. 
In our analysis, numerically subscribed e's denote the quantities bounded by the unit 
roundoff error. 

In a pivot block which is assigned to a particular process, in each inner step, we choose 
and orthogonalize a pair of columns. For simplicity, we may assume that these inner steps 
occur simultaneously in all processes. This single simultaneous transformation of p pairs 
of columns will be called a stage. 

At the beginning of the algorithm, the eigenvector matrix is set to the identity matrix, 
i.e., [Z*^"-' :— I. The superscript denotes the stage index throughout the iterations, and the 
transformation [/(^^^) h> t/^^^ in stage £ consists of p independent trigonometric rotations. 



The i-th and j-th columns of the matrix U^^'^ are denoted by u\ and Uj , respectively. 

In addition, all the computed quantities are denoted by tilde, i.e., and Uj denote 

the i-th and j-th columns of the computed matrix U^^\ 

Let i and j be the indices of columns transformed in one of the processes in stage £. 
When the columns are transformed by a trigonometric rotation, the exactly computed 
new columns would be 



(e-i) (e-i) 



c s 
-s c 



CU\ — SUj , su] 



-^3 



(A.l) 



Instead of ()A.1|) . the new columns actually computed in the floating-point arithmetic are 



c s 

'S c 



(A.2) 
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Here, we assume that the columns u^l and u"^ have already accumulated some error 
from the previous transformations. In the formula (jA.2j) . c and s denote the computed 
cosine and sine, respectively, so 

c = (1 + ec)c, s = (1 + £s)s. 

Typically (see, for example, f9, equation (30)]), we have 

N<^, N<T3^. (A.3) 

The computed elements of the columns ul^-* and u^p at the end of stage i are 



~{e) 



(l + £c)(l + £i)c4' '^-(l+£.)(l+e2)s4j ''1(1 +£3) 



(! + £,)(! +£4)^4' '^ + (l + £c)(l + £5)c?ig ''J(l+e6). 
In other words, 

and the elementwise errors, linearized up to the term of order 0{s^), are 
Suil^ = s^'^cuf^ ~ e'^'hui'-'\ £(1) = £, + £1 + £3, £(^) = £. + £2 + £3, 

fcg=£(3),4^-l)+£(4)eug-l), £(3)^e.+£4 + £6, S^^^ ^ £c + £5 + ^6, ^^'^^ 

for k — 1, . . . ,n. 

The previous formula can be written in a matrix form, which summarizes all p trans- 
formations in stage i. Instead of the orthogonal matrix t/*^^-*, we have computed a slightly 
perturbed matrix U^^^ 

U^i) = fi([7(^-i)QW) = L^(^-i)qW + (A.5) 
where Q^^-* is the computed matrix of p independent rotations used in stage i, while 



is its exact and orthogonal counterpart. So, (jA.SP is a matrix equivalent of (|A.2p for all 
processes. 

Theorem 1. Let U^^^^ be the exact orthogonal matrix of accumulated transformations 
after i stages of the parallel trigonometric Jacobi algorithm, and let U^^^ be the computed 
matrix in the floating-point arithmetic. Then 

^ I 
||;7(^)-[/W||2< ^ ||5[/('")|l2, (A.6) 

m— 1 

where SU^"^^ is the perturbation matrix defined by M.5|) in stage m of the algorithm. 
Moreover, if the error in all computed cosines and sines is bounded by IIA.3\) . then 

||C7W - C/(^)||2 < 14£^V2P", (A.7) 
with the right-hand side linearized up to the term of order 0{e^). 

Proof. The proof follows by induction over the stage index £. Let Z^^^ be the error 
committed after £ stages of the algorithm, 

for any £ > 0. At the beginning of the first stage, we have U^'^'' = JJ^'^^ = /, and Z^*^) = 0, 
so both claims are obviously true for £ — 0. 
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Now suppose that the claim is vahd at the beginning of stage i > 1, i.e., the computed 
matrix accumulated so far is U^-^^^^ — U'^^^^^ + Z^^~^\ with 

||^^'-'^ll2< 5] ||<^f/('"^||2. (A.8) 

m— 1 

After the transformation in stage £ (in all processes), the computed new matrix is 
U^'^'> + Z^^\ and from (|X5l) it follows that 

= C/(^-i)qW+Z(^-i)qW+^C/W. 

bmce t/W = [/('^-i)Q('^) is the exact new matrix after £ stages (a matrix equivalent of 
dSIl)), the new error matrix can be written as 

The exact transformation matrix Q'--^^ is orthogonal, and by unitary invariance of the 
spectral norm, we get 

\\ZW\\, < ||z(^-i)||2 + ||<5C/(^)||2, (A.9) 

By the induction hypothesis ()A.8p . this completes the proof of (jA.6|) . 

To prove the second claim, we need to bound the perturbation SU^-^^ in (jA.5|) . We 
shall prove a slightly more general result, and (jA.7p will then follow by using the bounds 
in (TOIl . 

We assume that all the cosines and sines in the algorithm are computed with relative 
errors Sc and bounded by 

\ec\<£cos, \es\<esin, (A.IO) 

where ecos and Egin depend on a particular algorithm that is used to compute the elements 
of each trigonometric rotation. The only requirement is that both bounds must satisfy 

£cos = 0(e), £,i„ = 0(e), (A.ll) 

linearized up to the terms of order 0{e^), with small "hidden" constants, which may be 
different. In this setting, the bound in (jA.7|) becomes 

||C/W_t/W||2< (£,o, + + 4eKy2^, (A.12) 

which is again true for ^ — 0. 

Suppose that, at the beginning of stage £> I, the error Z^^~^^ in the computed matrix 
[/('^-i) is bounded by (|XT2|) . i.e., 

WZ'-'-'^h < (£cos + e.in + 4e)(^-l)v/2^. (A.13) 
From C/('^-i) ^ U^^^^'> + Z'^-i) it immediately follows that 

l|{7(^-i)||2 < ||c/(^-i)||2 + \\z^'-'m2 = 1 + \\z^'-'^h. 

Thus, by (jA.lip and (jA.13|) . all the elements of the perturbed matrix [/(^~^) satisfy 

l4'r'^l < l + 0(£), k,l = l,...,n. 



Since \c\,\s\ < 1, from (jA.4p it follows that the elementwise perturbations in the trans- 
formed columns i and j (in a particular process) can be bounded by 

l<s4?U'^4?l<(kc| + ks| + 4£)(i + o(£)). 
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Of course, if a column is not transformed in stage £, the corresponding perturbations are 
equal to zero. Let -E'^^ be a matrix of order n with the following column structure — a 
column of E^^^^ is equal to zero if this column is not transformed in stage i, otherwise it 
is equal to e, where e is a vector with all elements equal to one. Since exactly 2p columns 
are transformed in stage i, the matrix E^^^ has 2p columns equal to e. 

Let the symbol | | denote the pointwise absolute value of a matrix. From the above 
argument, by using (jA.10|) over all p processes, we get the following linearized bound for 
the perturbation SU^''^^ in (|A.5|) 

\6U^'^ < (ecos + £si„ + 4£)£;(^). 

For any two matrices A, B e M"^", if \A\ < B, then \\A\\2 < \\B\\2 (see [13 Lemma 
6.6.(b)]). By taking A = SU'^^\ and B ^ (Ecos + Esin + 4£)£'(^), it follows that 

m^'^h < (£cos+esin+4e)||i?(^)||2. 
Since E*^^^ is of rank one, we have \\E''^^\\1 = trace([£;(^)]^£;(^)) = 2pn, so 

||<5;7(^'|j2 < (ecos + esin + 4£)v/2^. 

Now, (|A.12|) follows immediately from (|A.9p and (|A.13|) . 

Finally, if we take the linearized bounds from ()A.3p . Ecos = £sin — Se, the rela- 
tion (jA.12p becomes 

_ [/W||2 < Ue£y/2^, 

which proves (|A.7|) . □ □ 

In the sequential case p — 1, the bound ()A.7j) in the spectral norm is similar to the 
perturbation bound (in the Frobenius norm) given by Dopico, Koev and Molera in [5J 
Theorem 6]. 

The departure from orthonormality of the computed eigenvector matrix U in the 
spectral norm can also be estimated by using the bound (jA.7|) . We have 

Wu'^u -ih = wu'^u -u'^uh = wu'^u - u^u + u'^u - u^u\\2 

= \\{U-UfU + U^{U-U)\\2 <\\U-U\\2- ||C/||2 + 1- IIC/-C/II2. (A.14) 

Since 

IIC/II2 - ||C/ - [/ + f/|l2 < llf/ - f/||2 + ||f/|l2 - lit/ - C/II2 + 1, 

then ()A.14|) simplifies to 

\\U^U -I\\2<\\U- UhiWU -U\\2 + 2). 

Now, if the total number of stages is known, one can use this and the formula (jA.7|) to 
bound the departure from orthonormality. 
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